Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: standardize repo structure and other prep for open-sourcing #60

Merged
merged 13 commits into from
Oct 20, 2023

Conversation

hkeeler
Copy link
Member

@hkeeler hkeeler commented Oct 17, 2023

This PR is a bit of a grab bag of tune-up in prep for open-sourcing this repo. It includes:

  1. Structuring the repo to be more compliant with modern Python projects.
    1. Move tests out to top-level directory.
    2. Rename src/validator to regtech_data_validator.
  2. Consolidate external datasource transform code and source data under top-level data directory.
    1. Moved config.py settings into their respective scripts, and file paths are now passed in as CLI args instead.
  3. Move processed CSV files into the project itself. This allows for simpler data lookups via package name via importlib.resources. This allowed the removal of the ROOT_PATH Python path logic in all of the __init__.pys.
  4. Refactor global_data.py to load data only once where module is first imported.
  5. Refactor SBLCheck's
    1. warning: bool for a more explicit severity, backed by an enum that only allows ERROR and WARNING.
      1. Several of the warning-level validations were not setting warning=True, and were thus defaulting to False. This will prevent that. I also fixed all these instances.
      2. Removes the need for translation to severity when building JSON output.
    2. Use explicit args in the constructor, and pass all shared args on to parent class.
      This removes the need for the arg name/id error handling.
  6. Switch CLI output from Python dict to JSON.
  7. Rollback black version used in linting Action due to bug in latest version.

Note: Some of the files that I both moved and changed seem to now show as having deleted the old file and created a new one. I'm not sure why it's doing this. I did the moves and changes in separate commits, which usually prevents this, but doesn't seem to be the case here. Perhaps there's just so much change in some that git considers it a whole new file? 🤷 It's kind of annoying, especially if it results in losing git history for those files.

- Fixed all imports
- Fixed test and coverage settings in pyproject.toml
- Removed all Python path magic in __init.py__ files
- Moved data files into the repo, and used `importlib` to load files by
  package name instead of path. This is more portable, especially once
  we turn this into a distributable package.
- Refactored global_data to only load data once at module load time
- Move file format settings from config.py into respective data
  transform scripts
- Move src/dest file settings from config.py to CLI args
- Use consistent CLI arg and file exists handling
- Add openpyxl dependency for handling NAICS Excel reading
- Use required enum-based `severity` arg over boolean `warning`
  with default to false. This default is likely partially the cause of
  several warning-level validations being  set to error.
- Remove `test_checks.py` as those tests are no longer needed with
  refactored `SBLCheck`.
- Refactor JSON output to use `severity` enum value
- Refactor exception handling for Pandera's `SchemaErrors`
@hkeeler hkeeler requested a review from aharjati October 17, 2023 07:29
@github-actions
Copy link

github-actions bot commented Oct 17, 2023

Coverage report

The coverage rate went from 93.97% to 85.13% ⬇️
The branch rate is 82%.

81.25% of new lines are covered.

Diff Coverage details (click to unfold)

regtech_data_validator/checks.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/create_schemas.py

90% of new lines are covered (93.24% of the complete file).
Missing lines: 57, 62

regtech_data_validator/global_data.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/main.py

0% of new lines are covered (0% of the complete file).
Missing lines: 8, 13, 20, 26, 27, 29, 32, 34, 49, 50

regtech_data_validator/phase_validations.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/schema_template.py

100% of new lines are covered (100% of the complete file).

@@ -16,6 +16,9 @@ pytest-cov = "4.1.0"
black = "23.3.0"
ruff = "0.0.259"

[tool.poetry.group.data.dependencies]
openpyxl = "^3.1.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? I may have missed the usage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's required by the NAICS code processing script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that detail to the new README I added for that dataset. Each of those could use instructions on how to run those two scripts too.

@aharjati
Copy link
Contributor

this file: .../regtech-data-validator/.devcontainer/devcontainer.json
needs to be updated to use new tests path :

"python.testing.pytestArgs": [
          "--rootdir",
          "${workspaceFolder}/tests"
        ]

@hkeeler
Copy link
Member Author

hkeeler commented Oct 18, 2023

@aharjati, I haven't been running this in the DevContainer setup. What's the best way to test the fix? Is it basically just that the built-in VSCode test runner works as-expected?

@hkeeler
Copy link
Member Author

hkeeler commented Oct 18, 2023

I think we're good. This seems to work now...

image

@hkeeler
Copy link
Member Author

hkeeler commented Oct 18, 2023

...and now black seems to be failing on the PR Check, though it doesn't seem related to the code itself. Maybe there's something going on environment-wise with GitHub Actions at the moment? I tried kicking the Action back off manually, but it failed the same way. I'll try again later. 😕

@aharjati
Copy link
Contributor

...and now black seems to be failing on the PR Check, though it doesn't seem related to the code itself. Maybe there's something going on environment-wise with GitHub Actions at the moment? I tried kicking the Action back off manually, but it failed the same way. I'll try again later. 😕

we many need to update this list:
https://github.com/cfpb/regtech-data-validator/blob/pre-open-source/pyproject.toml#L39

@hkeeler hkeeler requested a review from aharjati October 20, 2023 06:45
@hkeeler
Copy link
Member Author

hkeeler commented Oct 20, 2023

The build is back in black 🤘 to ✅ . Looks like there's a bug in the latest version (psf/black#3953). I pinned it to the last major version, and it's happy again. We can roll it forward once they have a fix up for that.

@aharjati
Copy link
Contributor

The build is back in black 🤘 to ✅ . Looks like there's a bug in the latest version (psf/black#3953). I pinned it to the last major version, and it's happy again. We can roll it forward once they have a fix up for that.

Thanks!

Copy link
Contributor

@aharjati aharjati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hkeeler hkeeler changed the title [WIP] refactor: standardize repo structure and other prep for open-sourcing refactor: standardize repo structure and other prep for open-sourcing Oct 20, 2023
@hkeeler hkeeler merged commit ba6a1c4 into main Oct 20, 2023
@hkeeler hkeeler deleted the pre-open-source branch October 20, 2023 23:56
jcadam14 pushed a commit that referenced this pull request May 3, 2024
…#60)

Grab bag of tune-up in prep for open-sourcing this repo.

1. Restructure repo to be more compliant with modern Python projects.
    1. Move `tests` out to top-level directory.
    2. Rename `src/validator` to `regtech_data_validator`.
2. Consolidate external datasource code and data to `data` dir.
    1. Move `config.py` settings into their respective scripts, and file
        paths are now passed in as CLI args instead.
3. Move processed CSV files into the project itself. This allows for
    simpler data lookups via package name via `importlib.resources`.
    This allowed the removal of the `ROOT_PATH` Python path logic in all of
    the `__init__.py`s.
4. Refactor `global_data.py` to load data only once where module is
    first imported.
5. Refactor `SBLCheck`'s
    1. `warning: bool` for a more explicit `severity`, backed by an enum
        that only allows `ERROR` and `WARNING`.
        1. Several of the warning-level validations were not setting
            `warning=True`, and were thus defaulting to `False`. This will prevent
            that. I also fixed all these instances.
        2. Removes the need for translation to `severity` when building JSON
            output.
    2. Use explicit args in the constructor, and pass all shared args on to
        parent class. This removes the need for the arg `name`/`id` error handling. 
6. Switch CLI output from Python dict to JSON.
7. Rollback `black` version used in linting Action due to bug in latest version.
    -  psf/black#3953

**Note:** Some of the files that I both moved _and_ changed seem to now
show as having deleted the old file and created a new one. I'm not sure
why it's doing this. I did the moves and changes in separate commits,
which usually prevents this, but doesn't seem to be the case here.
Perhaps there's just so much change in some that git considers it a
whole new file? 🤷 It's kind of annoying, especially if it results in
losing git history for those files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants